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Abstract 

This paper deals with the problem of density estimation. We aim at building an estimate 
of an unknown density as a linear combination of functions of a dictionary. Inspired by 
Candes and Tao's approach, we propose an ^i-minimization under an adaptive Dantzig 
constraint coming from sharp concentration inequalities. This allows to consider a wide 
class of dictionaries. Under local or global coherence assumptions, oracle inequalities are 
derived. These theoretical results are also proved to be valid for the natural Lasso estimate 
associated with our Dantzig procedure. Then, the issue of calibrating these procedures is 
studied from both theoretical and practical points of view. Finally, a numerical study shows 
the significant improvement obtained by our procedures when compared with other classical 
procedures. 
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1 Introduction 

Various estimation procedures based on li penalization (exemplified by the Dantzig procedure in 
|13j and the LASSO procedure in [28]) have extensively been studied recently. These procedures 
are computationally efficient as shown in |17 [ [24 ^ [25]. and thus are adapted to high-dimensional 
data. They have been widely used in regression models, but only the Lasso estimator has been 
studied in the density model (see [3 [TOl [29]). Although we will mostly consider the Dantzig 
estimator in the density model for which no result exists so far, we recall some of the classical 
results obtained in diflFerent settings by procedures based on h penalization. 

The Dantzig selector has been introduced by Candes and Tao [13] in the linear regression 
model. More precisely, given 

Y = AXo + £, 

where Y e M", A is a n by M matrix, e G M" is the noise vector and Xq € M*^ is the unknown 
regression parameter to estimate, the Dantzig estimator is defined by 

= arg min \\\h. subject to \\A'^(A\ ~ V)L < 77, 
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where || • ji^ is the sup-norm in M , || • \\i-^ is the £i norm in M , and 77 is a regularization 
parameter. A natural companion of this estimator is the Lasso procedure or more precisely its 
relaxed form 

= arg min |i||AA - Y\\l +r;||A||,,| , 

where 77 plays exactly the exact same role as for the Dantzig estimator. This £1 penalized method 
is also called basis pursuit in signal processing (see |14[ I15j). 

Candes and Tao [13] have obtained a bound for the £2 risk of the estimator A^ , with large 
probability, under a global condition on the matrix A (the Restricted Isometry Property) and a 
sparsity assumption on Aq, even for M > n. Bickel et al. [3] have obtained oracle inequalities 
and bounds of the £p loss for both estimators under weaker assumptions. Actually, Bickel et al. 
[3] deal with the non parametric regression framework in which one observes 

Yi = fixi) + ei, i = l,...,n 

where / is an unknown function while (xi)i=i^....„ are known design points and (ei)i=i_..._„ is a 
noise vector. There is no intrinsic matrix A in this problem but for any dictionary of functions 
TT = ('Pm)m=i,...,j\/ one can search / as a weighted sum f\ of elements of T 

M 

fx = ^ Xmfm 
m— 1 

and introduce the matrix A = {(pmixi))i,m, which summarizes the information on the dictionary 
and on the design. Notice that if there exists Aq such that / = f\g then the model can be 
rewritten exactly as the classical linear model. However, if it is not the case and if a model bias 
exists, the Dantzig and Lasso procedures can be after all applied under similar assumptions on 
A. Oracle inequalities are obtained for which approximation theory plays an important role in 

Let us also mention that in various settings, under various assumptions on the matrix A 
(or more precisely on the associated Gram matrix G — A'^A), properties of these estimators 
have been established for subset selection (see [HI [201 [221 [231 EOl EI]) and for prediction (see 
[aim [201 [231 [32]). 

1.1 Our goals and results 

We consider in this paper the density estimation framework already studied for the Lasso estimate 
by Bunea et al [TlllO] and van de Geer [29|. Namely, our goal is to estimate /o, an unknown density 
function, by using the observations of an rt-sample of variables Xi, . . . , Xn of density /q. As in 
the non parametric regression setting, we introduce a dictionary of functions T — {fm) 771=1,..., m, 
and search again estimates of fo as linear combinations f\ of the dictionary functions. We rely 
on the Gram matrix associated with T and on the empirical scalar products of /o with (fm 

1 " 

Pm = — } fm{Xi). 

The Dantzig estimate f^ is then obtained by minimizing over the set of parameters A 

satisfying the adaptive Dantzig constraint: 

Vme {1,....A/}, |(GA)™-/3™| <7y^,™ 
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where for m G {1, . . . , M}, (GX)m is the scalar product of f\ with 



l 2dl,-l\0gM 2\\ipm\\ool\0gM 

cr,*;^ is a sharp estimate of the variance of j3m and 7 is a constant to be chosen. Section [2] gives 
precise definitions and heuristics for using this constraint. We just mention here that rj^^m comes 
from sharp concentration inequahties to give tight constraints. Our idea is that if /o can be 
decomposed on T as 

M 

then we force the set of feasible parameters A to contain Aq with large probability and to be as 
small as possible. Significant improvements in practice are expected. 

Our goals in this paper are mainly twofold. First, we aim at estabHshing sharp oracle in- 
equalities under very mild assumptions on the dictionary. Our starting point is that most of the 
papers in the literature assume that the functions of the dictionary are bounded by a constant 
independent of M and n, which constitutes a strong limitation, in particular for dictionaries 
based on histograms or wavelets (see for instance [6j, [7], [8], [9], [TT] or [29]). Such assumptions 
on the functions of T will not be considered in our paper. Likewise, our methodology does not 
rely on the knowledge of ||/o||oo that can even be infinite (as noticed by Birge [4] for the study of 
the integrated L2-risk, most of the papers in the literature typically assume that the sup-norm 
of the unknown density is finite with a known or estimated bound for this quantity) . Finally, let 
us mention that, in contrast with what Bunea et al [lO] did, we obtain oracle inequalities with 
leading constant 1, and furthermore these are estabHshed under much weaker assumptions on 
the dictionary than in [TO] . 

The second goal of this paper deals with the problem of calibrating the so-called Dantzig 
constant 7: how should this constant be chosen to obtain good results in both theory and 
practice? Most of the time, for Lasso-type estimators, the regularization parameter is of the form 

°-\f^^ with a a positive constant (see [3], [7], [6], [9], [H], [20] or [23] for instance). These 
results are obtained with large probability that depends on the tuning coefficient a. In practice, it 
is not simple to calibrate the constant a. Unfortunately, most of the time, the theoretical choice 
of the regularization parameter is not suitable for practical issues. This fact is true for Lasso-type 
estimates but also for many algorithms for which the regularization parameter provided by the 
theory is often too conservative for practical purposes (see [18] who clearly explains and illustrates 
this point for their thresholding procedure). So, one of the main goals of this paper is to fill the 
gap between the optimal parameter choice provided by theoretical results on the one hand and 
by a simulation study on the other hand. Only a few papers are devoted to this problem. In 
the model selection setting, the issue of caHbration has been addressed by Birge and Massart 
[5] who considered ^o-penalized estimators in a Gaussian homoscedastic regression framework 
and showed that there exists a minimal penalty in the sense that taking smaller penalties leads 
to inconsistent estimation procedures. Arlot and Massart [1] generalized these results for non- 
Gaussian or heteroscedastic data and Reynaud-Bouret and Rivoirard [26] addressed this question 
for thresholding rules in the Poisson intensity framework. 

Now, let us describe our results. By using the previous data-driven Dantzig constraint, oracle 
inequalities are derived under local conditions on the dictionary that are valid under classical 
assumptions on the structure of the dictionary. We extensively discuss these assumptions and 
we show their own interest in the context of the paper. Each term of these oracle inequalities is 
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easily interpretable. Classical results are recovered when we further assume: 



"'^""^-'^ (l^) 

where ci is a constant. This assumption is very mild and, unlike in classical works, allows to 
consider dictionaries based on wavelets. Then, relying on our Dantzig estimate, we build an 
adaptive Lasso procedure whose oracle performances are similar. This illustrates the closeness 
between Lasso and Dantzig-type estimates. 

Our results are proved for 7 > 1. For the theoretical calibration issue, we study the perfor- 
mance of our procedure when 7 < 1. We show that in a simple framework, estimation of the 
straightforward signal /o ~ l[o,i] cannot be performed at a convenient rate of convergence when 
7 < 1. This result proves that the assumption 7 > 1 is thus not too conservative. 

Finally, a simulation study illustrates how dictionary-based methods outperform classical 
ones. More precisely, we show that our Dantzig and Lasso procedures with 7 > 1, but close to 1, 
outperform classical ones, such as simple histogram procedures, wavelet thresholding or Dantzig 
procedures based on the knowledge of ||/o||oo and less tight Dantzig constraints. 

1.2 Outlines 

Section [2] introduces the density estimator of /o whose theoretical performances are studied in 
Section [21 Section |4] studies the Lasso estimate proposed in this paper. The calibration issue is 
studied in Section [STD and numerical experiments are performed in Section [5^21 Finally, Section 
[6] is devoted to the proofs of our results. 



2 The Dantzig estimator of the density /o 

As said in Introduction, our goal is to build an estimate of /o as a linear combination of func- 
tions of T = (v3m)m=i,...,M, where we assume without any loss of generality that, for any m, 

Il<y5m||2 = 1: 

M 



For this purpose, we naturally rely on natural estimates of the L2-scalar products between /o 
and the (y^m's. So, for m e {1, . . . , M}, we set 



/?o,m = y i^m(a;)/o(x)da;, (1) 
and we consider its empirical counterpart 

1 " 

/3m = - V</?m(^») (2) 

n ^-^ 

1=1 

that is an unbiased estimate of /9o,m- The variance of this estimate is Var(/3m) = where 

o-o,m = / ¥'m(a;)/o(a;)da;-/3^^„. (3) 
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Note also that for any A and any m, the L2-scalar product between f\ and ifm can be easily 
computed: 

j '^ra{x)f\{x)dx =^ ^ A„/ j ipra'{x)ipm{x)dx ^ {G\)ra 
m' — 1 

where G is the Gram matrix associated to the dictionary T defined for any 1 < m,m' < M by 

Gm,r,i' = / fm{x)ipm'{x)dx. 



Any reasonable choice of A should ensure that the coefficients (GA)m are close to Pm for all m. 
Therefore, using Candes and Tao's approach, we define the Dantzig constraint: 

Vme {!,.... A/}, |(GA)„-/3„| <77^,™ (4) 

and the Dantzig estimate by = fxon with 

A'^'''' = argmin;)^g][jM ||A||^j such that A satisfies the Dantzig constraint (U, 

where for 7 > and me {1, . . . ,M}, 



/ 2a^7logM 2||y^|U7logM 
'?7,™ = V + ^ ' (5) 

with 

~2 .2 , .||,, II ,/ 2a^7logA/ , 8||^,„||g,7log^/ ..^ 

C^m = O-m + 2 V^m 00 \/ \ (6) 

V n n 

and 

^ n i — 1 

= ^j^)T.T.iv^(^^) - vux,))'. (7) 

^ i=2 ] = 1 

Note that jy-y,™ depends on the data, so the constraint ^ will be referred as the adaptive Dantzig 
constraint in the sequel. We now justify the introduction of the density estimate 

The definition of TyA,^ is based on the following heuristics. Given to, when there exists a con- 
stant Co > such that /o(a;) > cq for x in the support oiipm satisfying llv^mllL — On{ni\ogM)~^), 
then, with large probability, the deterministic term of ^ is negligible with respect to the random 
one. In this case, the random term is the main one and we asymptotically derive 



a2 



77^,™« W27logM^. (8) 



n 



Having in mind that cr'^/n is a convenient estimate for Var(/3m) (see the proof of Theorem [T|), 
the shape of the right hand term of the formula ([8]) looks Hke the bound proposed by Candes and 
Tao jl3] to define the Dantzig constraint in the linear model. Actually, the deterministic term 
of ([5]) allows to get sharp concentration inequalities. As often done in the literature, instead of 
estimating Var(/3m), we could use the inequality 

Var(/3„) = ^ < 



and we could replace (t^^ with ||/o||oo in the definition of the rj^^m- But this requires a strong 
assumption: /o is bounded and ||/o||oo is known. In our paper, Var(/3m) is estimated, which allows 
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not to impose these conditions. More precisely, we slightly overestimate ctq^ to control large 
deviation terms and this is the reason why we introduce cr^ instead of using ct^, an unbiased 
estimate of ctq^ . Finally, 7 is a constant that has to to be suitably calibrated and plays a capital 
role in practice. 

The following result justifies previous heuristics by showing that, if 7 > 1, with high proba- 
bility, the quantity |/3m — /3o,m| is smaller than rj^^m for all to. The parameter r/^,™ with 7 close 
to 1 can be viewed as the "smallest" quantity that ensures this property. 

Theorem 1. Let us assume that M satisfies 

n< M < exp{n^) (9) 

for 6 < 1. Let 7 > 1. Then, for any £ > 0, there exists a constant Ci(e,(5, 7) depending on e, 6 
and 7 such that 

¥ (Vm e {1, . . . ,M}, \Po,m - > V^,m) < Ci(£,5,7)Afi-T*i. 
In addition, there exists a constant C2((5, 7) depending on S and 7 such that 

P (Vm e {1, . . . , M}, ij^-i < 77^,„, < ) < C2 {S, i)M^-'< 
where, for m £ {1, . . . , M}, 



(_^ /87logM , 2||(^„|U7logM 



and 



(+) _ /l67logAf , 10||(^„||oo7logM 



This result is proved in Section 16.11 The first part is a sharp concentration inequality proved 
by using Bernstein type controls. The second part of the theorem proves that, up to constants 



depending on 7, rj^^m is of order co^m y + ||ym loo ^"^^"'^ with high probability. Note that the 
assumption 7 > 1 is essential to obtain probabilities going to 0. 
Finally, let Aq — (Ao,m)m=i,....Af € M*^ such that 



M 



m—1 

where Py is the projection on the space spanned by T. We have 

(GAo)m = / (Pr 10)^771 = / fo'^m. = Po,m- 



So, Theorem [T] proves that Aq satisfies the adaptive Dantzig constraint ([4]) with probability larger 
than 1 — Ci (e, 5, 'y)M^~~ for any e > 0. Actually, we force the set of parameters A satisfying the 
adaptive Dantzig constraint to contain Aq with large probability and to be as small as possible. 
Therefore, f^ = f^o.^ is a good candidate among sparse estimates linearly decomposed on T 
for estimating /g. 

We mention that Assumption Q can be relaxed and we can take M < n provided the 
definition of tj-y^m, is modified. 
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3 Results for the Dantzig estimators 



In the sequel, we will denote = A to simplify the notations, but the Dantzig estimator 
still depends on 7. Moreover, we assume that ^ is true and we denote the vector rj^ — 
i'i]'y,m)m=i,...,M considered with the Dantzig constant 7 > 1. 



3.1 The main result under local assumptions 

Let us state the main result of this paper. For any J C {!,..., M}, we set J*^ = {1, ... , M} \ J 
and define Aj the vector which has the same coordinates as A on J and zero coordinates on J'^ . 
We introduce a local assumption indexed by a subset Jq. 

• Local Assumption Given Jq C {1,...,M}, for some constants kjj, > and iJ,j„ > 
depending on Jq, we have for any A, 

||/a||2 > njJXjJi^ - fijg {\\\jc\\i^ - \\Xjo\\i^^^- {LA{Jq,kj„,^ij„)) 
We obtain the following oracle type inequality without any assumption on /q. 



Theorem 2. Let Jq C {l,...,Af} he fixed. We suppose that {LA{Jo, kj^, fj.jg)) holds. Then, 
with probability at least 1 — Ci{e,S,^)M^~~ , we have for any P > 0, 



2 



with 



If- -Ml < M I If,- Mil + ( 1 + '-^^^^ ] + I6IJ0I (U^) ll%lll 

ASK" I |Jo| \ KJo / VP «Jo 



(l|A^lk-||A|k 
A(A,Jo^) = ||A,c||,, + ^ 



(10) 



2 

Let us comment each term of the right hand side of (fTOl) . The first term is an approximation 
term which measures the closeness between /o and f\. This term can vanish if /o can be 
decomposed on the dictionary. The second term is a price to pay when either A is not supported 
by the subset Jq considered or it does not satisfy the condition ||A^||£j < \\X\\e-^ which holds as 
soon as A satisfy the adaptive Dantzig constraint. Finally, the last term, which does not depend 
on A, can be viewed as a variance term corresponding to the estimation on the subset Jq. Indeed, 
remember that r]^,m relies on an estimate of the variance of $m- Furthermore, we have with high 
probability: 



II '77 11^ 



2 < 



2N 



So, if /o is bounded then, CTq < ||/o||oo and if there exists a constant ci such that for any to, 

ym\L<cJ-^)\\M\o., (11) 



Jog A/, 

(which is true for instance for a bounded dictionary), then 



2 /^„^„ logM 



L < cWfoW 

(where C is a constant depending on 7 and ci) and tends to when n goes to 00. We obtain 
thus the following result. 
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Corollary 1. Let Jq C {1,...,M} he fixed. We suppose that (LA{Jq, Kjg.jj,j„)) holds. If ill]) 

is satisfied then, with probability at least 1 — Ci{e,S,^)M^~~ , we have for any P > 0, for any 
A that satisfies the adaptive Dantzig constraint 

\\P - .foil < I/a - .foil + + «7>2J JoD^ff^ + C3(/3-' + «^7o')l^o|||/olU^, (12) 

where C2 is an absolute constant and C3 depends on ci and 7. 

The parameter /3 calibrates the weights given for the bias and variance terms. Remark that 



if /o = /ao and if {LA{Jo, kj^, holds with Jq — Jaq, under ifTTj) . the proof of Theorem [2] 



yields the more classical inequality 

||/^-/o||^<C"|Jo|||/o||oo 



where C = C3K , with at least t he same probability 1 — Ci{e, S, j)M^ 1+= . 

Assumption ^LA{Jq, kj^, fij^) ) is local, in the sense that the constants njg and fijg (or their 



mere existence) may highly depend on the subset Jq. For a given A, the best choice for Jq 
in Inequalities l|10p and lfT2|) depends thus on the interaction between these constants and the 
value of A itself. Note that the assumptions of Theorem [2] are reasonable as the next section 



gives conditions for which Assumption {LA{Jo, njg, ^ijg) \ holds simultaneously with the same 



constant k and /i for all subsets Jq of the same size. 
3.2 Results under global assumptions 

As usual, when M > n, properties of the Dantzig estimate can be derived from assumptions on 
the structure of the dictionary T. For Z G N, we denote 



</'min(0 = ™m mm „ and 0max(O = max max .. . '.3 . 

\.n<i xeR"' \\>^.j\\i l"'l<iAeR" ||Aj||l 
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These quantities correspond to the "restricted" eigenvalues of the Gram matrix G. Assuming 
that (/'inin(0 a-nd (/'max(0 close to 1 means that every set of columns of G with cardinality 
less than I behaves like an orthonormal system. We also consider the restricted correlations 



max max 



{fx J ) fx'j, ) 



\J\<i A,A'eR" l|Aj||£j|A'j,||^2 ■ 

Small values of 0/,;/ mean that two disjoint sets of columns of G with cardinality less than I and 
V span nearly orthogonal spaces. We will use one of the following assumptions considered in [3J. 

• Assumption 1 For some integer 1 < s < Af/2, we have 

0„in(2s) > 6,,2s. (Al(s)) 

Oracle inequalities of the Dantzig selector were estabhshed under this assumption in the 
parametric linear model by Candes and Tao in [13]. It was also considered by Bunea, 
Ritov and Tsybakov [3J for non-parametric regression and for the Lasso estimate. The next 
assumption, proposed in j3], constitutes an alternative to Assumption 1. 
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Assumption 2 For some integers s and / such that 

M 



1 < s < 



l>s and s + l<M, 



(13) 



we have 

Z0mi„(s + > S?!'max(0- (A2(s,l)) 

If Assumption 2 is true for s and I such that I ^ s, then Assumption 2 means that 0min(O 
cannot decrease at a rate faster than and this condition is related to the "incoherent 
designs" condition stated in [23]. 



In the sequel, we set, under Assumption 1, 
and under Assumption 2, 



<?5'min(2s) 



> 0, /ii(s) 



A/s0min(2s) 



= \/0min(s + O 1 



c(0 



> 0, //2(s,0 



c(0 



Now, to apply Theorem [21 we need to check {LA{Jo, kj^, ^j^)) for some some subset Jq of 
{1, . . . , M}. Either Assumption 1 or Assumption 2 impHes this assumption. Indeed, we have the 
following result. 

Proposition 1. Let s and I two integers satisfying fl3]} . We suppose that llAl(s)^ or (A2(sJ)^ 
is true. Let Jo C {1, . . . , M} of size | Jo| = s and A G M^^, then we have 

Wfxh > k||Ajo||^2 - ^i (llAjcll^^ ~ lAjoIki 

with K — Ki{s) and fi — fJ-i{s) under (Al(s) ) (respectively n = K2{s,1) and fi — l^2{s,l) un- 
der lA2(s,l) ). If lAl(s) ) and lA2(s,l) ) are both satisfied, k = max(Ki(s), K2(s, Z)) and fi — 
min(/ii(s),/i2(s,Z)). 

Proposition [T] proves that Theorem [2] can be applied under Assumptions 1 or 2. In addition, 
the constants kj^ and fij„ only depend on | Jo|. From Theorem O we deduce the following result. 

Theorem 3. Let s and I two integers satisfying We suppose that {Al(s)) or (A2(s,l) ) is 

true. Then, with probability at least 1 — Ci(e, 5, 7)M^~t+? , we have for any (3 > 0, 



||/^-/o||^< inf, inf ^ 

AeR*^ JoC{l,....A/} 

\Jo\=s 



\\h-M\l + f3 



\\n l|2 
II '77 II 



whe 



A(A, Jo") = ||A,c||,, + 



|A^lk-||A|k 



Remark that the best subset Jq of cardinal s in Theorem [3] can be easily chosen for a given 
A: it is given by the set of the s largest coordinates of A. This was not necessarily the case in 
Theorem [2] for which a different subset may give a better local condition and then may provide a 
smaller bound. If we further assume the mild assumption ifTTj) on the sup norm of the dictionary 
introduced in the previous section, we deduce the following result. 
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Corollary 2. Let s and I two integers satisfying figj) . We suppose that { Al(s)^ or llA2(s,l)^ is 
true. If ( 01]) is satisfied, with probability at least 1 ~ Ci{e,S,'-f)M ~~ , we have for any /3 > 0, 
any A that satisfies the adaptive Dantzig constraint and for the best subset Jq of cardinal s (that 
corresponds to the s largest coordinates of X in absolute value), 

i/^ - foil < II/a - foil + /?C2(1 + K^V's)^^ + C3(/3-' + «-')5||/o||oo^, (14) 

s n 
where 02 is an absolute constant and C3 depends on ci and 7. 

Note that, when A is s-sparse so that Ajc — 0, the oracle inequaUty corresponds to the 
classical oracle inequality obtained in parametric frameworks (see [12] or [13] for instance) or in 
non-parametric settings. See, for instance [6], [7], [8], [9], [IT] or [29] but in these works, the 
functions of the dictionary are assumed to be bounded by a constant independent of M and n. 
So, the adaptive Dantzig estimate requires weaker conditions since under ifTTj) . ||</?m||oo can go to 
00 when n grows. This point is capital for practical purposes, in particular when wavelet bases 
are considered. 



4 Connections between the Dantzig and Lasso estimates 

We show in this section the strong connections between Lasso and Dantzig estimates, which has 
already been illustrated in [3] for non-parametric regression models. By choosing convenient 
random weights depending on 77^ for ^i-minimization, the Lasso estimate satisfies the adaptive 
Dantzig constraint. More precisely, we consider the Lasso estimator given by the solution of the 
following minimization problem 

A^^T = argmin;,gjj,« \ + 2 X! '^7,™!'*^™! \ > (15) 

I m=l ) 

where 

R{\)^ml~-Y.fx{x,). 

1=1 

Note that R{-) is the quantity minimized in unbiased estimation of the risk. For simplifications, 
we write A'^ — X'"'"'. We denote = f^^. As said in Introduction, classical Lasso estimates are 
defined as the minimizer of expressions of the form 

ji?(A) + 2,7^ |A™||, 

I m,=l J 



where r] is proportional to y • So, A^ appears as a data-driven version of classical Lasso 
estimates. 

The first order condition for the minimization of the expression given in (flSl) corresponds 
exactly to the adaptive Dantzig constraint and thus Theorem [3] always appHes to A'^. Working 
along the fines of the proof of Theorem [3](Replace fx by and by in ([26]) and (|27l) ). 
one can prove a slightly stronger result. 

Theorem 4. Let us assume that assumptions of Theorem\^ are true. Let Jq C {1, . . . , M} of 

size \,Jq\ = s. Then, with probability at least 1 — Ci(e,(5, 7)M^~t+» , we have for any /3 > 0, 



ll/''-/o||^-||/^-/( 



II 2 
0II2 



< (3 ° 1 + + 16s 



/3 



^2 
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To extend this theoretical result, numerical performances of the Dantzig and Lasso estimates 
will be compared in Section 15.21 



5 Calibration and numerical experiments 
5.1 The calibration issue 

In this section, we consider the problem of calibrating previous estimates. In particular, we prove 
that the sufficient condition 7 > 1 is "almost" a necessary condition since we derive a special and 
very simple framework in which Lasso and Dantzig estimates cannot achieve the optimal rate 
if 7 < 1 ("almost" means that the case 7 = 1 remains an open question). Let us describe this 
simple framework. The dictionary T considered in this section is the orthonormal Haar system: 

T^{cj),k: -l<J<Jo, 0<fc<2^}, 

with (/)_.io = 1[04], 2J«+i = n, and for 0<j<ja,0<k< 2^ - 1, 

4'jk = 2^/^ (l[fe/2^(fe+0.5)/2J] - l[(A;+0.5)/2J,(fe+l)/23]) ■ 

In this case, M — n. In this setting, since functions of T are orthonormal, the Gram matrix G 
is the identity. Thus, the Lasso and Dantzig estimates both correspond to the soft thresholding 
rule: 

M 
m—1 

Now, our goal is to estimate /o = 0-io = l[o.i] by using depending on 7 and to show 
the influence of this constant. Unlike previous results stated in probability, we consider the 
expectation of the L2-risk: 

Theorem 5. On the one hand, j/7 > 1, there exists a constant C such that 

nF-fo\\i<^, (16) 

n 

On the other hand, if j < 1, there exists a constant c and 6 < 1 such that 



q|/^-/o||^>^. (17) 



This result shows that choosing 7 < 1 is a bad choice in our setting. Indeed, in this case, the 
Lasso and Dantzig estimates cannot estimate a very simple signal (/o = l[o,i]) a-t a convenient 
rate of convergence. 

A small simulation study is carried out to strengthen this theoretical asymptotic result. 
Performing our estimation procedure 100 times, we compute the average risk Rnil) for several 
values of the Dantzig constant 7 and several values of n. This computation is summarized in 
Figure [1] which displays the logarithm of Rn{l) for n = 2'^ with, from top to bottom, J — 
4, 5, 6, . . . , 13 on a grid of 7's around 1. To discuss our results, we denote by 7inin('T^) the best 
7: 7min("-) — argmin^>Qi?„(7). We note that 1/2 < 7min('T-) < 1 for all values of n, with 7min(f^) 
getting closer to 1 as n increases. Taking 7 too small strongly deteriorates the performance while 
a value close to 1 ensures a risk withing a factor 2 of the optimal risk. The assumption 7 > 1 
giving a theoretical control on the quadratic error is thus not too conservative. Following these 
results, we set 7 = 1.01 in our numerical experiments in the next subsection. 
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Figure 1: Graphs of 7 i-^ log2(i?n(7)) for n = 2"^ with, from top to bottom, J = 4, 5, 6, . . . , 13 
5.2 Numerical experiments 

In this section, we present our numerical experiments with the Dantzig density estimator and 
their results. We test our estimator with a collection of 6 dictionaries, 4 densities described 
below and for 2 sample sizes. We compare our procedure with the adaptive Lasso introduced in 
Section [4] and with a non adaptive Dantzig estimator. We also consider a two-step estimation 
procedure, proposed by Candes and Tao [13], which improves the numerical results. 

The numerical scheme for a given dictionary T = ('/3m)m=i,...,j\/ and a sample (Xi)i=i^...^„ is 
the following. 

1. Compute Pm for all m, 

2. Compute (T^„, 

3. Compute rj^^m as defined in ([5]) by 



and 7 = 1.01. 

4. Compute the coefficients A^'''' of the Dantzig estimate, A^'''' = argmin^j^ggAf ||A||£j such that 
A satisfies the Dantzig constraint ^ 




with 




Vme {1,....M}, |(GA) 



m 



m 



with the homotopy-path-following method proposed by Asif and Romberg [2] 
5. Compute the Dantzig estimate f^''* = X]m=i ^m^4'm- 
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Note that we have implicitly assumed that the Gram matrix G used in the definition of the 
Dantzig constraint has been precomputed. 

For the Lasso estimator, the Dantzig minimization of step 4 is replaced by the Lasso mini- 
mization lfT5]) 



argmm;,gRAf 



|i?(A) + 2^?7^,„|A„|| 

I m=l J 



which is solved using the LARS algorithm. The non adaptive Dantzig estimate is obtained by 
replacing tr^ in step 3 by ||/o||oo- The two-step procedure of Candes and Tao adds a least-square 
step between step 4 and step 5. More precisely, let J^'^ be the support of the estimate A^''''. 
This defines a subset of the dictionary on which the density is regressed 



where Gjo.-j is the submatrix of G corresponding to the subset chosen. The values of X^+^^-^f 
outside J^''^ are set to and fD+LS,-i accordingly. 

We describe now the dictionaries we consider. We focus numerically on densities defined on 
the interval [0, 1] so we use dictionaries adapted to this setting. The first four are orthonormal 
systems, which are used as a benchmark, while the last two are "real" dictionaries. More precisely, 
our dictionaries are 

• the Fourier basis with M ~ n + 1 elements (denoted "Fou") , 

• the histogram collection with the classical number \/n/2 < M — 2^° < ^Jn of bins (denoted 
"Hist"), 

• the Haar wavelet basis with maximal resolution n/2 < M = 2^^ < n and thus M = 2^^ 
elements (denoted "Haar"), 

• the more regular Daubechies 6 wavelet basis with maximal resolution n/2 < M = 2^^ < n 
and thus M — 2^^ elements (denoted "Wav"), 

• the dictionary made of the union of the Fourier basis and the histogram collection and thus 
comprising M — n ~\- 1 + 2^° elements, (denoted "Mix"), 

• the dictionary which is the union of the Fourier basis, the histogram collection and the 
Haar wavelets of resolution greater than 2^° comprising M = n+l + 2^^ elements (denoted 
"Mix2"). 

The orthonormal families we have chosen are often used by practitioners. Our dictionaries 
combine very different orthonormal families, sine and cosine with bins or Haar wavelets, which 
ensures a sufficiently incoherent design. 

We test the estimators of the following 4 functions shown in Figured (with their Dantzig and 
Dantzig+Least Square estimates with the "Mix2" dictionary): 

• a very spiky density 

hit) = .47 X (4i X lt<.5 + 4(1 - t) X lt>.5) -I- .53 x (75 x 1 

• a mix of Gaussian and Laplacian type densities 

/ -(t-.45)V(2(.125)^) \ / 20|t-.67| \ 

JM) = .45 X 4- .55 X ' 
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• a mix of uniform densities on subintervals 



fsit) = .25 X ( -^1.33<t<A7^ + .75 X ^^1.64<t<.80 

• a mix of a density easily described in the Fourier domain and a uniform density on a 
subinterval 

U{t) = .45 X (1 + .9cos(27ri)) + .55 x (^-l^l ,,^<t<.80 

Boxplots of Figures [3] and [4] summarize our numerical experiments for n = 500 and n = 2000 
and 100 repetitions of the procedures. The left column deals with the comparison between 
Dantzig and Lasso, the center column shows the effectiveness of our data driven constraint and 
the right column illustrates the improvement of the two-step method. As expected, Dantzig 
and Lasso estimators are strictly equivalent when the dictionary is orthonormal and very close 
otherwise. For both algorithms and most of the densities, the best solution appears to be the 
"Mix2" dictionary, except for the density /i where the Haar wavelets are better for n — 500. 
This shows that the dictionary approach yields an improvement over the classical basis approach. 
One observes also that the "Mix" dictionary is better than the best of its constituent, namely the 
Fourier basis and the histogram family, which corroborates our theoretical results. The adaptive 
constraints are much tighter than their non adaptive counterparts and yield to much better 
numerical results. Our last series of experiments shows the significant improvement obtained 
with the least square step. As hinted by Candes and Tao [T^, this can be explained by the 
bias common to £i methods which is partially removed by this final least square adjustment. 
Studying directly the performance of this estimator is a challenging task. 

6 Proofs 

6.1 Proof of Theorem [1] 

To prove the first part of Theorem [H we fix to g {1, . . . , M} and we set for any i e {1, . . . , n}, 

ly, = -(^„(X0-/3o,™) 
n 



that satisfies almost surely 



n 

Then, we apply Bernstein's Inequality (see [21] on pages 24 and 26) with the variables Wi and 
—Wi: for any u > 0, 



\Pm - /?o,^| > J + ^^^^ I < 2e-". (18) 
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Figure 2: The different densities and their "Mix2" estimates. Densities are plotted in blue while 
their estimates are plotted in black. The full line corresponds to the adaptive Dantzig studied 
in this paper while the dotted line corresponds to its least square variant. 
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Dantzig/Non adapt. Dantzig Dantzig/Dantzig+LS 




Figure 3: Boxplots for n — 500. Left column: Dantzig and Lasso estimates. Center column: 
Dantzig estimates associated with adaptive and non-adaptive constraints. Right column: Our 
estimate and the two-step estimate. 



16 



Dantzig/Lasso 



Dantzig/Non adapt. Dantzig Dantzig/Dantzig+LS 
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Figure 4: Boxplots for n = 2000. Left column: Dantzig and Lasso estimates. Center column: 
Dantzig estimates associated with adaptive and non-adaptive constraints. Right column: Our 
estimate and the two-step estimate. 
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Now, let us decompose in two terms: 



- n 1 



2n ^ 2n ■ 



n(n — 1) ^—f, ^ 

^ ' i=2 .7=1 

2 



n{n — 1) 



with 



n n i— 1 

= - ^ifmiXi) - (3o,mf and U„ = ^ ^{(pm{Xi) - (3o,m){^m{Xj) - Po,m)- (19) 

i=l 1=2 j=l 

Let us first focus on s„ that is the main term of tr^ by applying again Bernstein's Inequality 
with 

— 

n 

which satisfies 



One has that for any u > 



with 



n 

But we have 



iE([4„-(^„(X,)-/3o,n^f]'). 



= ^(E [(^„(XO-/3o,™r]-<J 

2 

< ^(lbm|U + |/3o,m|)' 

. ^'''O,™ II II 2 



Finally, with for any u > 



= 2\/2a-o,TO||vTO||cx)'i/ - 
V n 



^ 3n ' 



we have 

PK„ > + < e-". (20) 
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The term u„ is a degenerate U-statistics that satisfies for any m > 

n\un\ > U{u)) < 6e-", 

with for any u > 

U{u) = ^Au^ + (^4V2 + Bui + (^2D + ^F^ u + 2V2CV^, 
where A, B, C, D and F are constants not depending on u that satisfy 

A<^v„^\\'L, 



n{n-l) _2 

< \l 7, Cro,m. 



„ / n(n — 1) , 
D<\l \ ' al^, 



and 



F < 2y2||(p„||^v/("-l)log(2«) 
(see [27]) • Then, we have for any u > 0, 

n(n - 1) 3n(n-l) \ 3 J ny'n - 1 



2\/2' 



It - 



y^n{n — 1) 3 7T,V« — 1 / — 1) 



Now, we take m that satisfies 

u — 0(71) 

and 

v/log(2n) < V2^. 

Therefore, for any £1 > 0, we have for n large enough. 



-f/H<ei<,„+(l6V2 + 8) 



2 „ qo II, „ ||2 



n(n-l)^ ^ ' V y nV^^ 3 n(n - 1) 

So, for n large enough, 

2 



-C/(zi)<ei<™ + Ci||^,„||L(^)\ 



where Ci = 16-\/2 + 19. Using Inequalities ((20l) and l(2T|) . we obtain 

,2 ^ -2 , c^„,\ , 2 rr^^.A \ _ TO ( _2 - 2 ^ 2 



f^O.m > + Siu) + — -rU{u) = P (To.m > 7 TTUri + S{u) 



n{n-l) ' 7 V n(n-l) " ' ' n{n ~ 

< P ((To' m > S« + + P (U„ > C/(m)) 
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Now, using l|24p . for any < £2 < 1, we have for n large enough, 

<7^ + S{u) + —^—-U{u) = + 2V2cro,m||<<3m||oo-\/ - + + , ^ {u) 

n[n — 1) y n on n[n — 1) 

— 2 3 

U CTq 2 ^ II ii2 f'^\^ 

n An \nJ 

<<T^ + 2\/2cro,m||<^m||ooy^+£2CTo,™+Ci||(/3™||^ (^)' • 

Therefore, 

(l-£2)<„><T^ + 2V2ao,™||(^™||ooy^+Ci||^™||^(^)'^ <7e-". (25) 

a=l-£2, = V2||v5m||ooW-, C = (T^+Ci||v?™||^ (-V 

and consider the polynomial 

P(x) = ax^ — 2bx — c, 

with roots t>±^E±^_ So, we have 



Now, let us set 



-P(o-O,m) > <^=^ 0-o,m > 



2 c 262 2feV&2 + ac 

— 7 + ^ 72 ■ 



It yields 



2 c 262 26V62 + ac\ ^ „ 



'.2 > c , 462 26V^\ 
which means that for any < £3 < 1, we have for n large enough. 



P (^4™ > (1 + (^'t2„ + Ci||^,„||L (^) ' + 8||^™||L^ + 2V2\\^r,^\\oo^|^^j'yl^ +0,1^^11, (^)^ 

Finally, we can claim that for any < £4 < 1, we have for n large enough. 



<™ > (1 + £4) (^a^ + 8||^™||L^ + 2||^,„|U^2a^y j < Te^". 

Now, we take u ~ "flogM. Under Assumptions of Theorem [TJ Conditions l(22|) and l(23|) are 
satisfied. The previous concentration inequality means that 
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Now, using l|18p . we have for n large enough, 

P \\PQ,m - Pm\ > Vl,m ) \ IPO,m - Pm\ > \l h — , CTo.m < (1 + '^ij^^-m 

+ P (|/3o,m - > 777,m, CT^,„i > (1 +e4)5-^j) 

<P |/3o.m-/3m| > 



'24™7(l + e4)-ilogM 2||</j„||oo7(l + £4)"'logM 



n 3ri 

+ P(4„> (l + £4)a2) 
< 2Af ~'^(^+^'')"' + 7M-^. 
Then, the first part of Theorem [T] is proved: for any e > 0, 

P(|/?o,m-/3m| >r,^,m) < C(e, (5, 7)A/-T^7 , 

where C(e, 5, 7) is a constant that depends on £, 5 and 7. 

For the second part of the result, we apply again Bernstein's Inequality with 

_ (^™(X,)-/3o,m)2-Cr2^ 
— 



which satisfies 

~ n ~ n 

One has that for any u > 

sn > 4™ + V2^+ < e-« 

with ^ 

So, for any u > 0, 

P (s„ > + 2V2ao,™||v^™||ooy^+ < e"". 

Now, for any £5 > 0, for any u > 0, 

Using ([21]), with 

n \3 S5J 

d-l>{l + £5)^0.™ + S{u) + -r^--TU{u)] = P ( S„ - — ^— tW„ > (1 + £5)fTo,m + -SCii) + 



n(n— 1) y \ n(n— 1) n{n 

< P (s„ > (1 + £5)CTo%„ + ^(m)) + P i-Un > U{u)) 
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Using l(24ll . 



Since 



> (1 + £l + £5)4™ + ^(") + ^llbmllL (^) 



< 7e-". 



_^^2a^7logAf , 2 II ^.^ loo 7 log M 



n 3n 

with 

~2 ^2^011 „ /2a^7logM , 8||</7„||^7logM 
we have for any eg > 0, 

< (l + .a) + r4|l.™llL(7logM)^ 



77,™ ^ - J^y^^c, J y — 



<(l+ea)r^^) U,U2||,,„" ./2a^.7logM , 8||^J|L7logM 



\ n 



-^m II OC' 



+ 1(1 +£-1) /^IIVm||oo7l0gM 



9^ 



,o n / 27 log A/ 

< (1 + £6)'ct' ' 



||</?m||oo7logMV , 4(l + e^i) /||</7™|U7logM 



+ 16(1 + £6) 

Finally, with u = jlogM, with probability larger than 1 — 7M~'^, 
(7^ < (1 + £1 + £5)^2 „ + 5(7 log M) + Ci 



rnlloo ^ 



/27logM\ 2 /^7logAfy 2 4 



and 

^7,™ < (l+^6)'(l+£5+£lX™ (^^^^^^j +(l+£6) J ll^mllooyg ■ g 

+ 2Ci(l + £6)2||^™||L(^^^j + (^^^^j (^4£6-l(l + £6) + 16(l + £6) + 

Finally, with £5 = 1, £1 = £5 = |, for n large enough. 



Note that a/32/3 + 32 + 8 + 32 + 8/9 = 9.1409. 

For the last part, starting from l(25|) with u = 7 log A/ and 62 — ^^^e have for n large enough 
and with probability larger than 1 — 1M~'^ , 



62^-2,0/^ II II R^ogM 2 /7logAfY 

yO-Q m < Cr„ +2V2cro,m||<Pm||ooY + 1 II || oo ( I 

^■-2^2 2 ||2 TlogM ||2 /^7logM\^ 

< Cr„ + y(To,m + 7||V3m||oo + ClWni\oo ( I 
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So, for n large enough, 



and 



"^^2 ^ , n|, ||2 7l0gM 2 



?77,Tn > Co 



6.2 Proof of Theorem d 



87logAf 2 II loo 7 log M 



In 



3n 



Let A = (Am)m=i,....A/ and set A = A — \^ . We have 

II/a - /oll^ = 1/^ - Ml + II/a - F\l + 2 1 {P{x) fo{x)){fx{x) - (26) 
We have ||/a - /^||i = II/aII- Moreover, with probabiHty at least 1 - Ci{e, S, j)M^~t^ , we have 



(/^(x)-/o(x))(A(x)-/^(x))dx 



M 



(-^m ~ ^m) (GA"°)m — /?( 



<l|A||.,2||ry,||, 



(27) 



where the last line is a consequence of the definition of the Dantzig estimator and of Theorem 
[TJ Then, we have 

i/^ - foil < II/a - Ml + 4|h,||,„||A||,, ~ wuwl 

We use then the following Lemma: 

Lemma 1. Let J C {1, . . . , M} . For any A e K*^ 

IIAjci,, < IIA^I,, +2||A^c|k + (||P|k - ||A||,,)^, 

where A = A'^ — A. 

Proof. [Proof of Lemma [1] This lemma is based on the fact that 

||A^lk<||A|k + (||P|k-||A|k)^, 

which implies that 

||A,; + A,;||,, + ||A,,c +A,,c|k < ||A,7|k + \\Xjc\U, + (||P||,, - ||A||,, 

and thus 

IIA^i,, - IIA^II,, + WAjck, - WXjck, < \\XjU, + WXjcU, + (||P|k - ||. 



= 0. 



Note that if A satisfies the Dantzig condition then by definition of A^: ^||A^|k — ||A|k^ 
Using the previous lemma, we have: 

(lA^clk - ||A,;J|,,)^ < 2||A^c||,, + (||P||,, - ||A||,,)^. 
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(||A"||fj-||A|Uj) 

Using now A(A, J§) — ||Ajc H ^ so that A(A, J§) — ||Ajc jj^^ as soon as A satisfies 

the Dantzig condition, we obtain 

||/a||2 > KjoWAjJi^ - nj„ (\\Ajc\\i^ - \\AjJi^^^ 



and thus 



We deduce thus 



and then since 



we have 



lAjJI,, <— ||/a||2 + 2^A(A,Jo- 



||A||,, <2||AjJ|,, +2A(A,Jo^) 

< 2^j;^||AjJ|,, + 2||A^, 



< 



l|A|k-||/A||^< 



2v1Jo| 



||/a||2 + 2A(A,4) 1 + 



II/AII2 < 



2mJo\/Po| 



II/aII^ 



+ 8|K||,^A(A,Jo^) 1 + 



2mjo vPoI 



< 16 Jn - 



1 1 



[3 ' k2 



Jo 



1 



which is the result of the theorem. 



6.3 Consequences of Assumptions 1 and 2 

To prove Proposition [1] we estabHsh Lemmas [2] and [3l In the sequel, we consider two integers 
s and I such that 1 < s < M/2, ^ > s and s + Z < M. We first recall Assumptions 1 and 
2. Assumption 1 is stated in a more general form, which allows to unify the statement of the 
subsequent results. 

• Assumption 1 

0min(s + > ^l,s+l- 

• Assumption 2 

I4>min{s + 1) > S(/)inax(0- 

In the sequel, we assume that Assumptions 1 and 2 are both true. 

Lemma 2. Let Jq C {1, . . . ,M} with cardinality \ Jo\ = s and A G M.^ . We denote by Ji the 
subset of {1, . . . , M} corresponding to the I largest coordinates of A (in absolute value) outside 



24 



Jo and we set Joi = Jo U Ji. We denote by Pj„^ the projector on the linear space spanned by 
{'Pm)meJoi- We have: 



II-P701/AII2 > V^ndJJ+l)\\^JoAe2 - min(/xi,/X2) \\^jc\\e^, 

with 

dl,s+l , NhnajJ) 

Ml = — , = and U2 = \ ; • 

Proof. For fc > 1, we denote by Jk the indices corresponding to the coordinates of A outside 
Jo whose absolute values are between the {{k — 1) x I + l)-th and the {k x l)-th largest ones 
(in absolute value). Note that this definition is consistent with the definition of Ji. Using this 
notation, we have 



IIPjo Ja||2 > ||Pjoi/a.„, II2 - II E PJo^fAJ, h 

k>2 

> I/Ajoi II2 - X] W^JoJAj^ ||2- 



k>2 

Since Jqi has s + I elements, we have 

||/a,„ ||2> V<^min(s + OI|A7oJk- 

Note that Pjo^Jaj^ ~ /c'joi ^^'^ some vector C S ffi*^. Since, 

{Pjoi JAj^ - IAj^ , Pjol IAj^ ) = 0, 

one obtains that 

||^'Joi/AjJ|2 = ifAj^JCj^^) 

and thus 



^jjl < ^l,S+l\\^Jk\U2\\Cjoi\\e2 < ^l,S+l\\^Jk\U2—M==== 

V</'min(s + 



- /-/. ''T' , 7Tl|Ajfcll^2l|-Pjoi/AjJ|2- 



This implies that 



||P^„/a.J|2 < ^''7' ||AjJk=MiVI||AjJ|,,. 
Moreover, using that Jk has less than I elements, we obtain that 

IIP701/A.JI2 < ||/a,J|2 < V</'max(OI|AjJk, = M2\//|| A^J|,, . 

Now using that ||Aj^_^J|^2 < ||Aj^ ||fj/\/Z, we obtain 

WPjoJaj^ II2 < min (^1, 112) \\Ajc \\i^ 

k>2 

and finally 



||fjoi/A||2 > \/(Ainin(s + OI|Ajoilk2 " min (/ii , /is) || A ,c ||^, . 
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Lemma 3. We use the same notations as in Lemma\^ For c > 0, assume that 

||Ajc||,, < lAjJI,, +c. (28) 

Then we have 

||^'joi/a||2 > niax(Ki,K2) ||A,/oJ|£2 - min (/xi, /X2) c, 

with 



c(0 



^0miii(s + 



Proof. Using Lemma[2]and l(28l) . we obtain that 



||-PJoi/a||2 > \/0min(s + l)\\Aj^-,Ji^ - Hlin (pti , ^2) (|| A , +c). 

Using ||AjQ||fj < -ys||Aj„||f2, we deduce that 



||^'joi/a||2 > (?!'min(s + ^ \/s Hiin (^i,^2)j lAjoJI^^ - c min (/xi , ^2) 
> max(/ti,K2) IIAjoJIf^ - cmin(^i,/^2) • 



6.4 Proof of Theorem [5] 

The dictionary considered here is the Haar dictionary {(fijkjj.k and is double- indexed. As a 
consequence, in the following, the quantity /^ojfe, Pjk^ '^o jk V-yjk, cf'^k ^% defined as in 
©) 10), (01), {SJ, ([6]) and ([7]) where (^m is replaced by (t)jk- Note that, since /o = l[o.i]! we have, 
for j 7^ —1, Po.jk — and for any j, ctq = 1 if A; g {0, . . . , 2^ — 1} and otherwise. 



The proof of (fT6|) is provided by using the oracle inequality satisfied by hard thresholding 
given by Theorem 1 of [27] and the rough control of the soft thresholding estimate by the hard 
one: 

An alternative is directly obtained by adapting the oracle results derived for soft thresholding 
rules in the regression model considered by Donoho and Johnstone [iBj . 
To prove l|17p . we establish the following lemma. 



Lemma 4. Let 7 < 1. We consider j G N such that 



(logrt)" ~ (logn)" 
for some a > 1. Then for all e > such that j + 2s < 1, 



n , 2n , , 

< 2' < TTZZZ^^ (29) 



2^-1 
k=0 



^ ^^^^i^(logn)--n-(^+-)(l + o„(l)). 
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Then, we use the following inequality. For j that satisfies l(29|) . we have for r > 0, 



-foil) > Y.^U\M-V,,k) l|^,.,|>,,,,, 

fe=0 ^ ' 

/ 2 \ 

fc=0 ^ ^ 

^ 2 2^ -1 



fe=0 
2 2^ -1 



fc=0 



So, if r and e are such that (1 + r)^7 + 2e < 1, then applying Lemma[H Inequality l|17p is proved 
for any S such that (1 + r)^7 + 2e < S < 1. 

Proof. [Proof of Lemma H] Let j that satisfies Jli) and < fc < 2-' - 1. We have 



So, for any < e < ^ < i. 



Now, 



^2 log"- 2||(/)j-^fc||oo7l0g't 



< J27iHi!!f(l + .).?, + 27||0,.||L^(.-+4)^ ■ 2||0..|U7logn 



Furthermore, we have 



2 



where Snjk and Unjk are defined as in (flOl) with (p„i replaced by 0jfe. This impHes that 



,,,, < j27(l + s)i^.„,.. + ./27(l + .)i^ X -A^K.I + M^^ (1 + VT 

Using l(2T1) . with probability larger than 1 — 6n^^, we have 

\unjk\ < J7(21ogn), 
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and, since ctq ^-j. = 1 



n{n — 1) 



3 

[/(21ogn) < £lv^+^logn + C3||</.„fc||^(^)%c4||</-„.||^(^) 



2 



where ci, C2, C3, C4, Ci and C2 are universal constants. Finally, with probability larger than 
1 — 6n~^, we obtain that 



y n n(n — 1) n \ n J 

So, since 7 < 1, there exists w{e), only depending on e such that with probability larger than 
l-6n-2. 



We set 



,„ , logn , ,22logn 

^7Jfc = \/27(l h w(£)- 



n n 



so T/7,jfe < ??7,jfe- Then, we have 

1 " 2 



n . 



2^ V- / 

— [^Xie[k2-i ,{k+0 .5)2-3 [ — iXie[{k+0.5)2-J ,{k+l)2- 
1=1 



2^' 
n 

with 

n n 
-^/fc = X/ lx«e[fc2-J,(fe+0.5)2-i[, -^jfc = ^ lXie[(fc+0.5)2-3,(fe+l)2-i[- 



i=l 



We consider j such that 



In particular, we have 



Now, we can write 



n 2n 

< 2^ < — , a > 1. 



(logn)" ~ (logn)« 
(logn) 



< n2-^ < (logn)". 



1 2i 

i=l 
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that implies that 

2^-1 



' =0 

2^-1 

> X ^ (^lfci|/3jfci>>j:;:7;i|"njfel<i^(2iogn)) 

fc=0 

k=0 ^ ^ 

- E ((^ife ~ ^jfe)^ 1 1 JV+ - JV- I > ^27(l+e) (JV+ +JV- ) log n+«;(e) log n ^ I = I < C^(21og") ^ 

- 7^2"^ (^^^^1 ~ ^^'1 ' I <i - JVri I > ^27(1+^) (iV+ +iV- ) log log n ^ I ""^'=1 < £^(21ogn) ^ 



Now, we consider a bounded sequence (wn)n such that for any n, Wn > ^(e) and such that 
is an integer with 



47(1 + £)/i„j log(n) + Wn log(n) 



and lini is the largest integer smaller or equal to n2 ^ ^ . We have 

^^nj ~ 47(1 + £)/x„j log n 



since 



Now, set 



(logn)" 



- 1 < n2-^-i - 1 < linj < n2-^-^ < 



1 ^ (logn)" 



that arc positive for n large enough. If Nj'^ = Inj and Nj-^ = nrinj then we have Nj[ — Nj-^ = ^Vnj ■ 
Finally, we obtain that 



2^-1 



k=0 
2^3 



> (log n)-2" [p (Ar+ = . TV- = _ p (^lunjk I > t/(21ogn))] 



-2a 



> Vnj (log n) 
>t;„j(logn)-^" X 



p2/in,(i_2p^.)n-2A„j _ 6 



(30) 
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where 

P] = j .Xi+o.b)2-i[{x)h{x)dx ^ j l[(i+o.5)2-j,2-j+i[(a;)/o(a;)rfa; = 2^^^^. 

Now, let us study each term of IpO]). We have 

= exp(2/i„jlog(pj)) 
= exp(2/i„,log(2-J-i)), 



2/In 



(1 - 2p,) 



exp ((n - 2/i„j) log(l - 2pj)) 
exp (-(n - 2finj)2^^ + o„(l)) 
exp(-n2-^) (l + o„(l)), 



and 



{n - 2p.nj) 



= exp {{n - 2flnj) log {n - 2/i„j)) 



exp (n - 2flnj) log n + log 1 



= exp ( {n - 2 fin] ) log n 



(l + o„(l)) 



= exp (nlogn - 2finj - 2//„j logn) (1 + o„(l)). 
Then, using the Stirling relation, n\ — n"e~"-\/27m(l + o„(l)), we deduce that 



\n-2fi„ 



xpf-(l-2p,)"-^^- X (l + o„(l)) 



n—2jlri 



e" (n - 2/i„j)"-2M„j ■f'j 
exp {-2jlnj) X 



exp (n log n) , 2,„,(^_2p,r-^,„, x(l + o„(l)) 



(n- 2/i„j)"-2Ar.j O 

, , exp (nlogn + 2/2„j log(2-J-i) - n2-J) 

exp (-2/i„j) X ^— — — — -^(1 + o„(l)) 

exp [n log n — 2iJ,nj — 2finj log n) 

exp (2/i,y logn + 2/i,y log(2"^'"i) - n2"^') (1 + o„(l)). 



It remains to evaluate Z„j! x m„j!: 



[-^) 72^72^^(1 +o„(l)) 



If we set 



then 



exp \0gln3 + rrinj log7n„j - 2/i„j) x 27r/i„j(l + o„(l)). 
= 0„(1), 



2/i 



P'nj ( 1 -^nj ) ; 
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and using that 



{1 +Xnj) logil+Xnj) = + Xnj) [ Xnj - + ^ + 0(X^^] 



^2 ™3 ™3 

— Xnj - I - r - r U[x^^) 



x'^ 

= Xnj H ^ Q h 0{Xnj), 



we obtain that 



Inj loglnj = M„j(l + X„j)log (/2„j(l + Xnj)) 

= Anj(l + Xnj)log{l + Xnj) + Anj(l + Xnj) log (Anj) 



/^nj I Xnj ~\~ 



xi, X^ 



— — + 0{xi^) + jlnj{^ + Xnj) log ifinj) ■ 



Similarly, we obtain that 

rUnj l0gm„j = flnj I -Xnj + ~^ + ~^ + 0{Xnj) j + /inj(l - Xnj) log {flnj) , 

that implies that 

Inj loglnj + rUnj log m„j = jlnj (x^j + 0{xj^j)) + 2/i„j log (jlnj) 



< jj-njXnj +'^i^nj^0g{n2 ^ ^) + O {jlnj Xnj) ■ 



Since 



we have, for n large enough. 



and 



jlnj X' 



nj 



Aji. 



7(1 + £)logn, 



P'njxlj + 0{jinjXnj) < (7 + 2e) logn 



Inj log Inj + m„j log mnj < (7 + 2£) log u + 2jlnj log(n2 
Finally, we have 

Z„j! X m„j! = exp (Z„j log/„j + m„j \ogm„j - 2jlnj) x 27r/i„j(l + o„(l)) 

< exp ((7 + 2e) logn + 2/i„j log(n2~^~^) - 2/i„j) x 27r/Lt„j(l + o„(l)). 

Since < £ < < i , we conclude that there exists 6 < 1 such that 



2^-1 



fe=0 



>t;„,(logn)-2" 
t;„j(logn)~^" 

2'Kjlnj 

27(l + £)e-2 



exp (2/j,,y logn + 2 jlnj log (2 ^ ^) - n2 ^) 



> 



6 

exp ((7 + 2£) logn + 2jlnj log(n2--?-i) - 2jinj) x 27r/i„j 

6 



(l + On(l)) 



exp (- (7 + 2£) log n - 2) 



> 



(logn)i-2"n-(''+2^)(l + o„(l)) 
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and Lemma |4] is proved. 
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